{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.418557Z", "iopub.status.busy": "2023-12-06T14:58:50.418350Z", "iopub.status.idle": "2023-12-06T14:58:50.436182Z", "shell.execute_reply": "2023-12-06T14:58:50.435663Z" } }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to accsr\n", "\n", "The goal of `accsr` is to simplify programmatic access to data on disc and in\n", "remote storage in python. We often found ourselves repeating the same lines of\n", "code for pulling something from a bucket, loading a file from a tar archive\n", "or creating a configuration module for storing paths to existing\n", "or to-be-loaded files. `accsr` allows doing all this directly from python,\n", "without relying on a cli or external tools.\n", "\n", "\n", "One of the design goals of accsr is to allow the users to use the same code\n", "for loading data and configuration, independently of the state of the local\n", "file system.\n", "\n", "For example, a developer with all data already loaded who wants to\n", "perform an experiment with some extended data set, would load\n", "the configuration with `get_config()`, instantiate a `RemoteStorage`\n", "object and call `pull()` to download any missing data from the remote storage.\n", "If no data is missing, nothing will be downloaded, thus creating no overhead.\n", "\n", "A user who does not have the data locally, would also call `get_config()`,\n", "(possibly using a different `config_local.json` file, with different access keys or namespaces),\n", "and then als call `pull()` with the same code. The data will be downloaded\n", "from the remote storage and stored locally.\n", "\n", "Thus, the code will never need to change between development, testing and deployment,\n", "and unnecessary overhead for loading data is reduced as much as possible.\n", "\n", "This approach also makes it easy to collaborate on data sets with the same code-base,\n", "and avoid stepping on each other's toes by accident." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The configuration module\n", "\n", "The configuration module provides utilities for reading configuration\n", "from a hierarchy of files and customizing access to them. Let us look at\n", "some use case examples for this." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.438606Z", "iopub.status.busy": "2023-12-06T14:58:50.438404Z", "iopub.status.idle": "2023-12-06T14:58:50.502932Z", "shell.execute_reply": "2023-12-06T14:58:50.502403Z" } }, "outputs": [], "source": [ "from accsr.config import ConfigProviderBase, DefaultDataConfiguration, ConfigurationBase\n", "from accsr.remote_storage import RemoteStorage, RemoteStorageConfig\n", "import os\n", "from pathlib import Path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up configuration providers\n", "\n", "The recommended way of using `accsr`'s configuration utils is to create a module called `config.py` within your project\n", "and setup classes and methods for managing and providing configuration. In the cell below we show a minimal example\n", "of such a file.\n", "\n", "Under the hood the config provider is reading out the `__Configuration` class from generics at runtime and makes sure\n", "that only one global instance of your custom `__Configuration` exists in memory. Don't worry if you are unfamiliar\n", "with the coding patterns used here - you don't need to understand them to use the config utils.\n", "You will probably never need to adjust the `ConfigProvider` related code." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.505486Z", "iopub.status.busy": "2023-12-06T14:58:50.505280Z", "iopub.status.idle": "2023-12-06T14:58:50.517156Z", "shell.execute_reply": "2023-12-06T14:58:50.516659Z" } }, "outputs": [], "source": [ "class __Configuration(ConfigurationBase):\n", " pass\n", "\n", "\n", "class ConfigProvider(ConfigProviderBase[__Configuration]):\n", " pass\n", "\n", "\n", "_config_provider = ConfigProvider()\n", "\n", "\n", "def get_config(\n", " reload=False, config_files=(\"config.json\", \"config_local.json\")\n", ") -> __Configuration:\n", " \"\"\"\n", " :param reload: if True, the configuration will be reloaded from the json files\n", " :param config_files: the list of files to load the configuration from\n", " :return: the configuration instance\n", " \"\"\"\n", " return _config_provider.get_config(reload=reload, config_files=config_files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading configuration from files\n", "\n", "We found the following workflow useful for managing configuration files:\n", "\n", "1. Create a `config.json` file in the root of your project.\n", "This file should contain all the default configuration and be committed to version control.\n", "2. Create a `config_local.json` file in the root of your project with the user-specific configuration.\n", "This file should not be committed to version control. It does not need to contain all the configuration,\n", "only the parts that are different from the default configuration.\n", "\n", "**NOTE**: Yaml files are also permitted, but by default the configuration is read from the two json files\n", "mentioned above. You can freely mix yaml and json and define your own hierarchy by passing `config_files`,\n", "so for example passing `config_files=(\"config.json\", \"config_local.yaml\")` is allowed.\n", "\n", "A typical use case is to have default configuration for the `RemoteStorage` in `config.json` and\n", "to have secrets (like the access key and secret), as well as a user-specific base path in `config_local.json`.\n", "In this way, multiple users can use the same code for loading data while still being able to experiment\n", "on their own data sets - for example storing these data sets in the same bucket but in different namespaces.\n", "\n", "Another use case is to include a read-only access key in `config.json`, which is then\n", "distributed to users in version-control, and a read-write access key in `config_local.json` for\n", "the developers who need to update data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Including environment variables\n", "\n", "One can tell the configuration to read the value off an environment variable instead of writing\n", "the value directly to the file. This is useful for example for running code in CI, where\n", "it might be easier to adjust environment variables instead of files (for example, while\n", "Gitlab CI offers file-type secrets, there is no such feature in GitHub actions at the time of writing).\n", "\n", "For instructing to read off the value from the env, simply prepend \"env:\" to the configured value,\n", "e.g. if your `config.json` looks as\n", "\n", "\n", "```json\n", "{\n", " \"configured_val\": \"fixed_value\",\n", " \"from_env_var\": \"env:MY_ENV_VAR\"\n", "}\n", "```\n", "\n", "then implementing the configuration as\n", "\n", "```python\n", "class __Configuration(ConfigurationBase):\n", " @property\n", " def configured_val(self) -> str:\n", " return self._not(\"configured_val\")\n", "```\n", "\n", "will result in the value of the property being read at **runtime** from the environment variable `MY_ENV_VAR`.\n", "Thus, changing the value of the environment variable will change the value of the property. This is in contrast\n", "to not-env-var values, which are read at config-loading time and will only change when the config is reloaded." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Default Configurations\n", "\n", "`accsr` includes a default implementation of the `ConfigurationBase` class meant for typical ML and data-driven\n", "projects. To use this, simply inherit from `DefaultDataConfiguration` instead of `ConfigurationBase`. The\n", "resulting configuration class will have some default properties and methods for managing paths to data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The RemoteStorage facilities\n", "\n", "`accsr` makes it easy to interact with data stored in a remote blob storage, like S3, Google Storage,\n", "Azure Storage or similar. The `RemoteStorage` implements a git-like logic and uses `apache-libcloud`\n", "underneath." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to demonstrate the RemoteStorage functionality, we will start [minIO](https://min.io/),\n", "an object store with S3 interface, using docker compose.\n", "We also switch to the tests directory where the docker-compose file and some resource files for\n", "testing have been prepared." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.519990Z", "iopub.status.busy": "2023-12-06T14:58:50.519532Z", "iopub.status.idle": "2023-12-06T14:58:50.530505Z", "shell.execute_reply": "2023-12-06T14:58:50.529851Z" } }, "outputs": [], "source": [ "notebooks_dir = Path(os.getcwd()).absolute()\n", "tests_dir = notebooks_dir.parent / \"tests\" / \"accsr\"\n", "\n", "os.chdir(tests_dir)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.532958Z", "iopub.status.busy": "2023-12-06T14:58:50.532573Z", "iopub.status.idle": "2023-12-06T14:58:50.544591Z", "shell.execute_reply": "2023-12-06T14:58:50.544085Z" } }, "outputs": [], "source": [ "if not os.getenv(\"CI\"):\n", " # In CI, we start the minIO container separately\n", " !docker-compose up -d\n", " host = \"localhost\"\n", "else:\n", " host = \"remote-storage\"\n", "\n", "port = 9001\n", "api_port = 9000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now should have minio up and running." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can instantiate a RemoteStorage object and interact with minIO." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.547162Z", "iopub.status.busy": "2023-12-06T14:58:50.546792Z", "iopub.status.idle": "2023-12-06T14:58:50.557615Z", "shell.execute_reply": "2023-12-06T14:58:50.557120Z" } }, "outputs": [], "source": [ "remote_storage_config = RemoteStorageConfig(\n", " provider=\"s3\",\n", " key=\"minio-root-user\",\n", " secret=\"minio-root-password\",\n", " bucket=\"accsr-demo\",\n", " base_path=\"my_remote_dir\",\n", " host=host,\n", " port=api_port,\n", " secure=False,\n", ")\n", "\n", "storage = RemoteStorage(remote_storage_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `base_path` is a \"directory\" (or rather a namespace) within the bucket.\n", "All calls to the storage object will only affect files in the `base_path`.\n", "\n", "The bucket itself does not exist yet, so let us create it.\n", "This has to be done by the user explicitly, to prevent accidental costs. Of course,\n", "if the configuration is pointing to an existing bucket, this step is not necessary." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.560147Z", "iopub.status.busy": "2023-12-06T14:58:50.559766Z", "iopub.status.idle": "2023-12-06T14:58:50.578913Z", "shell.execute_reply": "2023-12-06T14:58:50.578335Z" } }, "outputs": [], "source": [ "storage.create_bucket()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can push, pull, list and generally interact with objects inside `base_path` within the bucket.\n", "Let us first push the resources directory to have something to start with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pull` and `push` commands will return a summary of the transaction with the bucket.\n", "If the flag `dryrun=True` is specified, then the transaction is only computed but not\n", "executed - a good way to make sure that you are doing what is desired before actually\n", "interacting with data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-12-06T14:58:50.581591Z", "iopub.status.busy": "2023-12-06T14:58:50.581210Z", "iopub.status.idle": "2023-12-06T14:58:50.662252Z", "shell.execute_reply": "2023-12-06T14:58:50.661468Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "Scanning files in /__w/accsr/accsr/tests/accsr/resources: 0%| | 0/11 [00:00